Abstract:Transformer-based models are the foundation of modern machine learning, but their execution, particularly during autoregressive decoding in large language models (LLMs), places significant pressure on memory systems due to frequent memory accesses and growing key-value (KV) caches. This creates a bottleneck in memory bandwidth, especially as context lengths increase. Processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory. However, current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques. Consequently, they suffer from workload imbalance, reducing throughput and resource utilization. In this work, we propose STARC, a novel sparsity-optimized data mapping scheme tailored specifically for efficient LLM decoding on PIM architectures. STARC clusters KV pairs by semantic similarity and maps them to contiguous memory regions aligned with PIM bank structures. During decoding, queries retrieve relevant tokens at cluster granularity by matching against precomputed centroids, enabling selective attention and parallel processing without frequent reclustering or data movement overhead. Experiments on the HBM-PIM system show that, compared to common token-wise sparsity methods, STARC reduces attention-layer latency by 19%--31% and energy consumption by 19%--27%. Under a KV cache budget of 1024, it achieves up to 54%--74% latency reduction and 45%--67% energy reduction compared to full KV cache retrieval. Meanwhile, STARC maintains model accuracy comparable to state-of-the-art sparse attention methods, demonstrating its effectiveness in enabling efficient and hardware-friendly long-context LLM inference on PIM architectures.
Abstract:Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.
Abstract:Real-time transmission of visual data over wireless networks remains highly challenging, even when leveraging advanced deep neural networks, particularly under severe channel conditions such as limited bandwidth and weak connectivity. In this paper, we propose a novel Resilient Tokenization-Enabled (ResiTok) framework designed for ultra-low-rate image transmission that achieves exceptional robustness while maintaining high reconstruction quality. By reorganizing visual information into hierarchical token groups consisting of essential key tokens and supplementary detail tokens, ResiTok enables progressive encoding and graceful degradation of visual quality under constrained channel conditions. A key contribution is our resilient 1D tokenization method integrated with a specialized zero-out training strategy, which systematically simulates token loss during training, empowering the neural network to effectively compress and reconstruct images from incomplete token sets. Furthermore, the channel-adaptive coding and modulation design dynamically allocates coding resources according to prevailing channel conditions, yielding superior semantic fidelity and structural consistency even at extremely low channel bandwidth ratios. Evaluation results demonstrate that ResiTok outperforms state-of-the-art methods in both semantic similarity and visual quality, with significant advantages under challenging channel conditions.
Abstract:Recent studies have explored the working mechanisms of In-Context Learning (ICL). However, they mainly focus on classification and simple generation tasks, limiting their broader application to more complex generation tasks in practice. To address this gap, we investigate the impact of demonstrations on token representations within the practical alignment tasks. We find that the transformer embeds the task function learned from demonstrations into the separator token representation, which plays an important role in the generation of prior response tokens. Once the prior response tokens are determined, the demonstrations become redundant.Motivated by this finding, we propose an efficient Progressive In-Context Alignment (PICA) method consisting of two stages. In the first few-shot stage, the model generates several prior response tokens via standard ICL while concurrently extracting the ICL vector that stores the task function from the separator token representation. In the following zero-shot stage, this ICL vector guides the model to generate responses without further demonstrations.Extensive experiments demonstrate that our PICA not only surpasses vanilla ICL but also achieves comparable performance to other alignment tuning methods. The proposed training-free method reduces the time cost (e.g., 5.45+) with improved alignment performance (e.g., 6.57+). Consequently, our work highlights the application of ICL for alignment and calls for a deeper understanding of ICL for complex generations. The code will be available at https://github.com/HITsz-TMG/PICA.
Abstract:To improve Multimodal Large Language Models' (MLLMs) ability to process images and complex instructions, researchers predominantly curate large-scale visual instruction tuning datasets, which are either sourced from existing vision tasks or synthetically generated using LLMs and image descriptions. However, they often suffer from critical flaws, including misaligned instruction-image pairs and low-quality images. Such issues hinder training efficiency and limit performance improvements, as models waste resources on noisy or irrelevant data with minimal benefit to overall capability. To address this issue, we propose a \textbf{Vi}sual-Centric \textbf{S}election approach via \textbf{A}gents Collaboration (ViSA), which centers on image quality assessment and image-instruction relevance evaluation. Specifically, our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images. Finally, we reorganize 80K instruction data from large open-source datasets. Extensive experiments demonstrate that ViSA outperforms or is comparable to current state-of-the-art models on seven benchmarks, using only 2.5\% of the original data, highlighting the efficiency of our data selection approach. Moreover, we conduct ablation studies to validate the effectiveness of each component of our method. The code is available at https://github.com/HITsz-TMG/ViSA.
Abstract:This paper explores the concept of information importance in multi-modal task-oriented semantic communication systems, emphasizing the need for high accuracy and efficiency to fulfill task-specific objectives. At the transmitter, generative AI (GenAI) is employed to partition visual data objects into semantic segments, each representing distinct, task-relevant information. These segments are subsequently encoded into tokens, enabling precise and adaptive transmission control. Building on this frame work, we present importance-aware source and channel coding strategies that dynamically adjust to varying levels of significance at the segment, token, and bit levels. The proposed strategies prioritize high fidelity for essential information while permitting controlled distortion for less critical elements, optimizing overall resource utilization. Furthermore, we address the source-channel coding challenge in semantic multiuser systems, particularly in multicast scenarios, where segment importance varies among receivers. To tackle these challenges, we propose solutions such as rate-splitting coded progressive transmission, ensuring flexibility and robustness in task-specific semantic communication.
Abstract:The uncertainty of the sensing target brings great challenge to the beamforming design of the integrated sensing and communication (ISAC) system. To address this issue, we model the scattering coefficient and azimuth angle of the target as random variables and introduce a novel metric, expected detection probability (EPd), to quantify the average detection performance from a Bayesian perspective. Furthermore, we design a Bayesian beamforming scheme to optimize the expected detection probability under the limited power budget and communication performance constraints. A successive convex approximation and semidefinite relaxation-based (SCA-SDR) algorithm is developed for the complicated non-convex optimization problem corresponding to the beamforming scheme. Simulation results show that the proposed scheme outperforms other benchmarks and exhibits robust detection performance when parameters of the target are unknown and random.
Abstract:Accurate and reliable link quality prediction (LQP) is crucial for optimizing network performance, ensuring communication stability, and enhancing user experience in wireless communications. However, LQP faces significant challenges due to the dynamic and lossy nature of wireless links, which are influenced by interference, multipath effects, fading, and blockage. In this paper, we propose GAT-LLM, a novel multivariate wireless link quality prediction model that combines Large Language Models (LLMs) with Graph Attention Networks (GAT) to enable accurate and reliable multivariate LQP of wireless communications. By framing LQP as a time series prediction task and appropriately preprocessing the input data, we leverage LLMs to improve the accuracy of link quality prediction. To address the limitations of LLMs in multivariate prediction due to typically handling one-dimensional data, we integrate GAT to model interdependencies among multiple variables across different protocol layers, enhancing the model's ability to handle complex dependencies. Experimental results demonstrate that GAT-LLM significantly improves the accuracy and robustness of link quality prediction, particularly in multi-step prediction scenarios.
Abstract:As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with <1B parameters.
Abstract:The application of deep learning (DL)-based channel state information (CSI) feedback frameworks in massive multiple-input multiple-output (MIMO) systems has significantly improved reconstruction accuracy. However, the limited generalization of widely adopted autoencoder-based networks for CSI feedback challenges consistent performance under dynamic wireless channel conditions and varying communication overhead constraints. To enhance the robustness of DL-based CSI feedback across diverse channel scenarios, we propose a novel framework, ITUG, where the user equipment (UE) transmits only a selected portion of critical values in the CSI matrix, while a generative model deployed at the BS reconstructs the remaining values. Specifically, we introduce a scoring algorithm to identify important values based on amplitude and contrast, an encoding algorithm to convert these values into a bit stream for transmission using adaptive bit length and a modified Huffman codebook, and a Transformer-based generative network named TPMVNet to recover the untransmitted values based on the received important values. Experimental results demonstrate that the ITUG framework, equipped with a single TPMVNet, achieves superior reconstruction performance compared to several high-performance autoencoder models across various channel conditions.